MapReduce as a programming model

A Mapper function: Applies to every element in the collection


In [9]:
def fun(x):
    return x**2

A Reducer function: An arbitrary associative binary operation


In [12]:
def add(x,y):
    return x+y

def maximum(x,y):
    return max(x,y)

def minimum(x,y):
    return min(x,y)

def prod(x,y):
    return x*y

def selective_division(x,y):
    return x/y if x>y else y/x

Conventional sequential program


In [10]:
L = [1,3,4,2,7]

s = 0
for a in L:
    s = s+a**2
print(s)


79

The same calculation using the map primitive.


In [26]:
S = map(fun, L)
print(S)
reduce(add, S)


[1, 9, 16, 4, 49]
Out[26]:
79

No need to store the intermediate result: opens up the possibility for lazy evaluation


In [13]:
reduce(add, map(fun, L))


Out[13]:
79

In [15]:
selective_division(89, 65)


Out[15]:
1

In [28]:
#print(L)
print(S)

#print reduce(selective_division, S)

#print reduce(add, map(fun, L))

#print reduce(minimum, S)
print reduce(maximum, S)


[1, 9, 16, 4, 49]
49

Number of even entries


In [6]:
def is_even(x):
    return 1 if x%2==0 else 0

In [7]:
reduce(add, map(is_even, L))


Out[7]:
2

Calculating the average:


In [17]:
def one(x):
    return 1

In [20]:
reduce(add, L)/float(reduce(add, map(one, L)))


Out[20]:
3.4

Compute the average


In [30]:
A = ['Ahmet', 'Mehmet', 'Veli', 'Aras'] 

def is_A(x):
    return 1 if x[0]=='A' else 0

reduce(lambda x,y: x+y, map(is_A, A))

#reduce(add, A)


Out[30]:
2

In [31]:
def print_and_add(x,y):
    print(x,y)
    return x+y

In [32]:
L = range(16)

reduce(print_and_add, L)


(0, 1)
(1, 2)
(3, 3)
(6, 4)
(10, 5)
(15, 6)
(21, 7)
(28, 8)
(36, 9)
(45, 10)
(55, 11)
(66, 12)
(78, 13)
(91, 14)
(105, 15)
Out[32]:
120

Spark

RDD : Resilient Distributed Dataset

Note: This notebook version needs the Python 2.7. When the notebook server is launched from a 3.x environment, the environment variables are not set correctly.

Load a text file on the HDFS as a RDD

  • Display number of lines

In [38]:
#import findspark
#findspark.init()
#import pyspark
#sc = pyspark.SparkContext(appName="Deneme")

import sys
import numpy as np

#textFile = sc.textFile("data/books-eng/hamlet.txt")
textFile = sc.textFile("data/books-eng")
textFile.count()


Out[38]:
133228

Get the first line


In [39]:
textFile.first()


Out[39]:
u'ACT I'

Find number of lines where a word appears


In [40]:
word = "CLAUDIUS"
textFile.filter(lambda lin: word in lin).count()


Out[40]:
127

Get a random sample from the file


In [54]:
textFile.sample(withReplacement=False, fraction=0.05).first()


Out[54]:
u'The office and devotion of their view'

In [55]:
def prnt(x):
    print x

textFile.sample(withReplacement=False, fraction=0.05).take(10)


Out[55]:
[u'Flourish. Enter ANTONY, CLEOPATRA, her Ladies, the Train, with Eunuchs fanning her',
 u'Look, where they come:',
 u'CLEOPATRA ',
 u'Then must thou needs find out new heaven, new earth.',
 u'Grates me: the sum.',
 u'MARK ANTONY ',
 u"There's not a minute of our lives should stretch",
 u'CLEOPATRA ',
 u'The qualities of people. Come, my queen;',
 u'Exeunt MARK ANTONY and CLEOPATRA with their train']

In [59]:
textFile.take(10)


Out[59]:
[u'ACT I',
 u"SCENE I. Alexandria. A room in CLEOPATRA's palace.",
 u'',
 u'Enter DEMETRIUS and PHILO ',
 u'PHILO ',
 u"Nay, but this dotage of our general's",
 u"O'erflows the measure: those his goodly eyes,",
 u"That o'er the files and musters of the war",
 u"Have glow'd like plated Mars, now bend, now turn,",
 u'The office and devotion of their view']

Find the line with most words


In [60]:
textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)


Out[60]:
122

In [66]:
textFile.map(lambda line: (len(line.split()),line) ).reduce(lambda a, b: a if (a[0] > b[0]) else b)


Out[66]:
(122,
 u"    [Enter a King and a Queen very lovingly; the Queen embracing him, and he her. She kneels, and makes show of protestation unto him. He takes her up, and declines his head upon her neck: lays him down upon a bank of flowers: she, seeing him asleep, leaves him. Anon comes in a fellow, takes off his crown, kisses it, and pours poison in the King's ears, and exit. The Queen returns; finds the King dead, and makes passionate action. The Poisoner, with some two or three Mutes, comes in again, seeming to lament with her. The dead body is carried away. The Poisoner wooes the Queen with gifts: she seems loath and unwilling awhile, but in the end accepts his love]")

In [61]:
'abc df'.split()


Out[61]:
['abc', 'df']

In [65]:
s = 'xyz'
pair = (len(s), s)

pair[0]


Out[65]:
3

Counting words


In [67]:
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
wordCounts.collect()


Out[67]:
[(u'Exasperated,', 2),
 (u'aided', 2),
 (u'drunkenness.', 2),
 (u'glad,"', 5),
 (u'Chichagov,', 8),
 (u"unvarnish'd", 1),
 (u'Danois', 1),
 (u'admiration?', 1),
 (u'forges', 3),
 (u'blisters', 2),
 (u'habilet\xe9;', 1),
 (u'brightening', 4),
 (u'ever?', 1),
 (u'travaux', 2),
 (u'Romeo,--', 1),
 (u'Kirilovich', 1),
 (u'pedestrian,', 1),
 (u'bourreau.', 1),
 (u'party.', 10),
 (u'rocking', 6),
 (u'piti\xe9!--Elle', 1),
 (u'booths.', 1),
 (u'looking', 415),
 (u'flash.', 3),
 (u'Senior', 1),
 (u'pointing', 78),
 (u'inheritance.', 2),
 (u'Sewer,', 1),
 (u'(His', 1),
 (u'larger,', 1),
 (u'contained', 7),
 (u"reckon'd.", 1),
 (u'Oka', 1),
 (u'mark-man!', 1),
 (u'donner?', 1),
 (u'Eh,', 7),
 (u'war!"', 1),
 (u'Craftsmen', 1),
 (u'"Encouraging', 1),
 (u'attack,"', 2),
 (u'enclosures,', 1),
 (u'1813.', 2),
 (u'nourisher', 1),
 (u'artilleryman', 4),
 (u'motifs', 1),
 (u'cozenage', 1),
 (u'be!"', 12),
 (u'today?', 3),
 (u'trumpeters,', 1),
 (u'blanket', 3),
 (u'breadth.', 2),
 (u'goats,', 2),
 (u'particulars.', 1),
 (u'powder.', 2),
 (u'CHRIS', 10),
 (u'smack', 2),
 (u'pacify', 2),
 (u'yellow', 25),
 (u'voix', 8),
 (u'patrie,', 1),
 (u'prodigies,', 1),
 (u'Ottoman.', 1),
 (u'postpone', 3),
 (u'vast', 36),
 (u'Lest', 15),
 (u'revelers', 1),
 (u'present!', 1),
 (u'baking', 1),
 (u'destitution,', 1),
 (u'skills', 2),
 (u'Pierre:', 5),
 (u'(R2-D2).........................................KENNY', 2),
 (u'spacecraft', 2),
 (u'relentlessly', 1),
 (u'ending,', 2),
 (u'blowing,', 1),
 (u'ivrogne.', 1),
 (u"hack'd", 1),
 (u'gracious:', 1),
 (u'RODERIGO.--Quoi!', 1),
 (u'stretchers.', 1),
 (u'glued', 2),
 (u'satisfaire,', 1),
 (u'stars!', 2),
 (u'anywhere!"', 1),
 (u'"mon', 2),
 (u'parted', 16),
 (u'Mamma."', 3),
 (u'Draws', 2),
 (u'scouted', 1),
 (u'BREMNER', 1),
 (u'first...', 1),
 (u'warns', 2),
 (u'smeared', 8),
 (u'parlons', 2),
 (u"Bitski's", 1),
 (u'Annette', 2),
 (u'equal,', 1),
 (u'blaspheme', 1),
 (u'rescript', 5),
 (u'deprived.', 1),
 (u'purged,', 2),
 (u'anybody,', 1),
 (u'cantina,', 2),
 (u'rope', 6),
 (u'gold!', 1),
 (u'him--felt', 1),
 (u'hide', 60),
 (u'children?"', 1),
 (u'records,', 1),
 (u'CATO,', 2),
 (u'beeping.', 1),
 (u'Spacecraft', 1),
 (u'finesse;', 1),
 (u'victims', 3),
 (u'astonishment.', 4),
 (u'unpleasant,', 1),
 (u'better;', 1),
 (u'See-Threepio', 4),
 (u'subordinate,', 1),
 (u'majesties', 2),
 (u'difficulty.', 8),
 (u'tips', 6),
 (u'supple,', 1),
 (u'grunting', 2),
 (u'kid.', 11),
 (u'suave,', 1),
 (u'insinuation', 1),
 (u'Stolypin', 2),
 (u'hero', 17),
 (u'10,', 1),
 (u'Conceit', 2),
 (u'Oh...', 3),
 (u'Amuse', 1),
 (u'feisty', 1),
 (u'Fortuna..............................................MICHAEL', 1),
 (u'(raw', 1),
 (u'wife--are', 1),
 (u'Brother.', 1),
 (u'backs,', 3),
 (u'darling....', 1),
 (u'glass.', 15),
 (u'clenching', 3),
 (u'battery?', 1),
 (u'shriek', 4),
 (u'beloved;', 1),
 (u'Fili', 4),
 (u'quieter,"', 1),
 (u'gests.', 1),
 (u'bellow.', 1),
 (u'glories;', 1),
 (u'bedding', 1),
 (u'worse.', 13),
 (u'Antonius,', 1),
 (u'Mary--"Haven\'t', 1),
 (u'fr\xe9n\xe9sie:--Tout', 1),
 (u'MAGIC', 2),
 (u'DESD\xc9MONA.--Ma', 2),
 (u'selvedges.', 1),
 (u'sombre', 3),
 (u'smoke', 90),
 (u'NOT', 4),
 (u'PORTMAN', 1),
 (u'settled', 54),
 (u'says,"', 2),
 (u'arbiters,"', 1),
 (u'punctilious', 1),
 (u'circumstance.', 1),
 (u'ridiculed', 2),
 (u'sugar.', 1),
 (u'Mantua;', 2),
 (u'straw', 12),
 (u'd\xe9sire', 1),
 (u'carack:', 1),
 (u'Were,', 1),
 (u"KENOBI'S", 1),
 (u'Pokrovsk,', 2),
 (u'us?!', 1),
 (u'Master!..."', 1),
 (u'for--I', 1),
 (u'resonant', 1),
 (u'v\xe9rit\xe9,', 16),
 (u'breath?', 4),
 (u'reassured.', 1),
 (u'anyhow', 3),
 (u'rareness,', 1),
 (u'Wookiee', 35),
 (u'Star', 139),
 (u'patchwork', 1),
 (u'leurs', 29),
 (u'XXXVIII', 1),
 (u'exacte', 1),
 (u'dispatches', 2),
 (u'hospital', 18),
 (u'exhort', 2),
 (u"diplomat's", 1),
 (u'blacksmiths', 1),
 (u'exception;', 2),
 (u'backing', 1),
 (u"saw't:", 1),
 (u'crafty,', 1),
 (u'strike', 35),
 (u'chewing', 4),
 (u'trot.', 4),
 (u'(softly)', 2),
 (u'used,', 3),
 (u'circonstance', 1),
 (u'presided', 1),
 (u'higher,', 5),
 (u'instructions...', 1),
 (u'property--that', 1),
 (u'barbarous', 1),
 (u'"Pretty,', 1),
 (u'unconsciousness.', 1),
 (u'ronflent;', 1),
 (u'jiggles', 1),
 (u'meat,', 5),
 (u"JAGO.--N'en", 1),
 (u'mettant', 1),
 (u'off?', 1),
 (u'Never!"', 1),
 (u'IAGO', 288),
 (u'GARTIN,', 1),
 (u'troops.', 37),
 (u'Likhachev;', 1),
 (u'ouvrent', 1),
 (u'bow.', 3),
 (u'FIGHTER', 38),
 (u'guest:', 1),
 (u'sped', 1),
 (u'DEBORAH', 1),
 (u'Settlement.', 1),
 (u'EVEN', 2),
 (u'Jena', 4),
 (u'word', 220),
 (u'uncovered', 4),
 (u'yourself...."', 3),
 (u'40', 1),
 (u'T-forty-', 1),
 (u'famed', 2),
 (u'Crown', 1),
 (u'laughter;', 1),
 (u'Copenhagen--a', 1),
 (u'majesty;', 2),
 (u'respect:', 2),
 (u'choose,', 2),
 (u'"Impudence!', 1),
 (u'dispirited.', 1),
 (u'underlings.', 1),
 (u'criticizing', 3),
 (u'ruddy,', 1),
 (u'Guildenstern;', 1),
 (u'haters.', 1),
 (u'vicomte,"', 1),
 (u'shore!', 1),
 (u'\xe9tiez', 1),
 (u'"Voyons,', 1),
 (u'"Wherefore?"', 1),
 (u'contented.', 2),
 (u'collision.', 1),
 (u'fichu', 2),
 (u'Charity,', 1),
 (u'"H-o-o-w', 1),
 (u"l'oreille", 4),
 (u'Napoleon;', 2),
 (u'someone?--he', 1),
 (u'underclothing,', 1),
 (u'fair!', 1),
 (u'Quant', 3),
 (u'Holding', 7),
 (u'DAVIS', 2),
 (u'organizer', 1),
 (u'workbag."', 1),
 (u'blank,', 1),
 (u'Salomoni!', 1),
 (u'landlady,', 1),
 (u'ceremony', 12),
 (u'rites?', 1),
 (u'Darklighter', 1),
 (u'dismantled,', 1),
 (u'Feoktist?"', 1),
 (u'LODOVICO.--Sa', 1),
 (u'night,--', 1),
 (u'machine', 20),
 (u'tuer.--En', 1),
 (u'ablaze.', 1),
 (u'desperately,', 6),
 (u'officier,', 1),
 (u'Recalling', 2),
 (u'd\xe9mon!', 2),
 (u"Tsarevich's", 1),
 (u'ordinary', 36),
 (u'III', 26),
 (u'gentleman"),', 1),
 (u"letter's", 1),
 (u'HORATIO,', 7),
 (u'reasonable."', 1),
 (u'laide', 2),
 (u"infant's", 2),
 (u'pure-white', 1),
 (u'Urusov', 1),
 (u'flair.', 1),
 (u'sullenly', 2),
 (u'CASSIO.--Tu', 1),
 (u'fellows:', 1),
 (u'Rostopchin,', 20),
 (u'Dialogue', 2),
 (u'Russen!', 1),
 (u'Apocalypse.', 1),
 (u'Will', 109),
 (u'hereupon', 1),
 (u'walkway,', 2),
 (u"d'amiti\xe9", 1),
 (u'food;', 3),
 (u'misunderstandings,', 1),
 (u'domed', 4),
 (u'cuisse', 2),
 (u'chilled', 2),
 (u'cold;', 3),
 (u'ch\xe8vre', 1),
 (u"confess'd.", 1),
 (u'Antony,', 50),
 (u'"Despite', 1),
 (u'revolt', 4),
 (u'neck...', 1),
 (u'beasts,"', 1),
 (u'unsmirched', 1),
 (u'chers', 2),
 (u'zoology', 1),
 (u'power..."', 1),
 (u'burgher', 1),
 (u'Frenchie!', 1),
 (u'KNOWLES', 1),
 (u'amis', 3),
 (u'vo\xfbte', 1),
 (u'opposition,"', 1),
 (u'Eye', 2),
 (u'sun,', 22),
 (u'peur', 2),
 (u'endormie,', 1),
 (u'goggles.', 2),
 (u'unwrung.', 1),
 (u'"Karay?', 1),
 (u'beau,', 1),
 (u'Moff', 8),
 (u'hibernation', 1),
 (u'surrendered', 9),
 (u'judge,"', 1),
 (u'plated', 1),
 (u'quelquefois', 3),
 (u'Rapidly', 2),
 (u'miscarry', 1),
 (u'General,', 9),
 (u'exhibition,', 1),
 (u'luck!"', 3),
 (u'glances', 33),
 (u'one-ruble', 1),
 (u'hobbledehoy', 1),
 (u'MARTY', 1),
 (u'merrier,', 1),
 (u'mettent', 1),
 (u'Attendants]', 11),
 (u'tears?', 3),
 (u'pains!', 1),
 (u'good-by', 6),
 (u'died;', 3),
 (u'bouquet', 1),
 (u'haquen\xe9es', 1),
 (u'routine,', 1),
 (u'spy.', 2),
 (u'savory', 4),
 (u'pressing,', 1),
 (u'bon!', 1),
 (u'crunch.', 1),
 (u'anxiety.', 5),
 (u'peu:', 1),
 (u'starships', 2),
 (u'fortunate:', 1),
 (u'bog.', 2),
 (u'affected', 16),
 (u'manifesto.', 1),
 (u'fragment', 6),
 (u'vainglorious', 1),
 (u'malice,', 2),
 (u'Dunyasha...."', 1),
 (u'MV-7.', 1),
 (u'meadow,', 6),
 (u'history.', 15),
 (u'HANSON,', 1),
 (u'JAN', 1),
 (u'Anon', 3),
 (u'Steps', 2),
 (u'sack', 3),
 (u'suitor', 7),
 (u'depression', 9),
 (u'confiant?', 1),
 (u'17th', 1),
 (u"Othello_.--Qu'avez-vous", 1),
 (u'senator.', 1),
 (u'manners,', 4),
 (u'exhaled', 1),
 (u'spotlighted', 1),
 (u'Fates', 1),
 (u'weather.', 3),
 (u'orderlies', 10),
 (u'mammas', 1),
 (u'Balashev', 71),
 (u'sold."', 1),
 (u'rendered,', 1),
 (u'WANNBERG', 2),
 (u'Potsdam,', 1),
 (u'Captains,', 1),
 (u"donne-m'en", 1),
 (u'OTHELLO.--Avez-vous', 2),
 (u'Decius,', 2),
 (u'organized', 5),
 (u'hems.', 1),
 (u'Uh,', 10),
 (u'WILLIAMS', 4),
 (u'"great"', 2),
 (u'twirled', 4),
 (u'lifted', 65),
 (u'tea?"', 5),
 (u'Solicit', 1),
 (u'assistants.', 3),
 (u'Verona.', 1),
 (u'XC.', 1),
 (u'bowls', 3),
 (u'supposed.', 1),
 (u"Walk'd", 1),
 (u'Medicine', 2),
 (u'communications,', 1),
 (u'Fahrenheit', 1),
 (u'laugh?', 1),
 (u'seminude', 1),
 (u'machinant', 1),
 (u'deviennent', 1),
 (u'ghosts', 3),
 (u'appurtenance', 1),
 (u'distinguee,', 1),
 (u"'Hurrah!'--a", 1),
 (u'damne-toi,', 1),
 (u'confineless', 1),
 (u'"Greatness,"', 1),
 (u'langue', 2),
 (u'smoked,', 3),
 (u'at,', 7),
 (u'headdress', 2),
 (u'glaive!--Encore', 1),
 (u'refuge!', 1),
 (u'Institute', 1),
 (u'speed?', 1),
 (u'Animatronics', 1),
 (u'start,', 5),
 (u'lain', 8),
 (u"Nothing's", 1),
 (u'wooingly', 1),
 (u'bed."', 2),
 (u'question;', 1),
 (u'nitrogen', 1),
 (u'clink;', 1),
 (u'dummy!"', 1),
 (u'jour!', 3),
 (u'supplies,', 2),
 (u'assured', 21),
 (u'derision.', 1),
 (u'bottommost', 1),
 (u'eternal', 23),
 (u'tourment\xe9', 2),
 (u'atmosphere."', 1),
 (u'Discussions', 1),
 (u'indices', 1),
 (u'XXX.', 1),
 (u'steepy', 1),
 (u'springtime', 1),
 (u'cloth,', 3),
 (u'Guardsman,', 1),
 (u'contingencies', 4),
 (u"l'amour", 14),
 (u'knit', 10),
 (u'souveraine,', 1),
 (u'Retire-toi;', 1),
 (u"if't", 5),
 (u'Emperor--without', 1),
 (u'waddled', 2),
 (u"know't;", 1),
 (u"roof'd,", 1),
 (u'LOFTHOUSE', 1),
 (u'"They\'ve', 6),
 (u'us--namely', 1),
 (u'lour', 1),
 (u'raise,', 1),
 (u'brooding', 3),
 (u"'Las,", 1),
 (u'local', 5),
 (u'filter', 2),
 (u'renew', 8),
 (u'old--the', 1),
 (u'features,', 7),
 (u'cane.', 1),
 (u'Human', 2),
 (u'comptes-y', 1),
 (u'Truck', 1),
 (u'tricks?', 1),
 (u'tactics', 4),
 (u'"Denisov!', 1),
 (u'gossips,', 2),
 (u'lathe.', 1),
 (u'Cooper', 3),
 (u'render', 12),
 (u'sea-bank', 1),
 (u'hillock.', 1),
 (u"s'accommodera", 1),
 (u'negative,', 1),
 (u'hour!', 2),
 (u'bombard', 2),
 (u'camps,', 4),
 (u'SPACECRAFT', 1),
 (u'"Awfully', 1),
 (u'Field.', 2),
 (u'delicious', 4),
 (u'separation!', 1),
 (u'"Marya', 4),
 (u'prolong\xe9', 1),
 (u'justes', 1),
 (u'moment..."', 1),
 (u'man--that', 1),
 (u'knight.', 1),
 (u'"Wait--just', 1),
 (u'lisping:', 1),
 (u'Wanna', 2),
 (u'continuing', 15),
 (u'ha-ve', 1),
 (u'attack', 107),
 (u'MADINE', 1),
 (u'shipwrights,', 1),
 (u'impudente', 1),
 (u'NICHOLS,', 2),
 (u'Uncle', 25),
 (u'tune', 13),
 (u'corns;', 1),
 (u'd\xe9p\xeache-toi.', 1),
 (u'father..."', 2),
 (u'purposefully', 2),
 (u'empty!', 1),
 (u'next--to', 1),
 (u"headquarters.'", 1),
 (u'unlearn', 1),
 (u'naught:', 2),
 (u'nightingale:', 1),
 (u'Morality.', 1),
 (u'moisture', 5),
 (u'Jobert', 2),
 (u'monotonous,', 1),
 (u'short-circuit.', 2),
 (u'travers', 2),
 (u'floor.', 27),
 (u'purses', 1),
 (u'prouder', 1),
 (u'Vaska,', 1),
 (u'Villanous', 2),
 (u'selected.', 2),
 (u'OTHELLO.--Allez', 2),
 (u'furlough', 3),
 (u'Wait.', 2),
 (u'battle--the', 1),
 (u'fiction,', 1),
 (u'positively', 7),
 (u'yours?"', 3),
 (u'Him!...', 1),
 (u'targes', 1),
 (u'creatures,', 4),
 (u'plate,', 5),
 (u'twenty,', 2),
 (u'exist....', 1),
 (u'joke...."', 1),
 (u'spurring', 3),
 (u"d'hypocrisie", 1),
 (u'hors', 5),
 (u'1977', 2),
 (u'body;', 1),
 (u'indecision,', 3),
 (u'guilt', 11),
 (u'boil', 3),
 (u'to,', 59),
 (u'Archduchy', 1),
 (u'audibly.', 1),
 (u'girlishly', 1),
 (u'talk--all', 1),
 (u'girl!', 5),
 (u'woman!"', 5),
 (u'WIPPERT', 1),
 (u'we?', 6),
 (u'"Heavens!', 2),
 (u"Kutaysov's", 1),
 (u'swallowed:', 1),
 (u'falcon"', 1),
 (u'subsided.', 1),
 (u'CLEOPATRA', 215),
 (u'Pavlovna--who', 1),
 (u'bloodshot,', 1),
 (u'mari.', 7),
 (u'tasted.', 2),
 (u'Willarski,', 8),
 (u'shy,', 6),
 (u'nonpareil.', 1),
 (u'swindled', 1),
 (u'staggeringly', 1),
 (u'jam,', 1),
 (u'leafy', 2),
 (u'intellect,', 4),
 (u'wagered', 2),
 (u'moist.', 3),
 (u"counterfeit'st", 1),
 (u'else--in', 1),
 (u'Staying', 2),
 (u'wholesome', 10),
 (u'audacity!', 1),
 (u'POV', 1),
 (u'often,', 5),
 (u'railings,', 1),
 (u'Manager', 7),
 (u'uttermost?', 1),
 (u'ensemble.)', 1),
 (u'Emperor.', 54),
 (u"pineapples.'", 1),
 (u'doors', 34),
 (u'impalpable,', 1),
 (u'Foh!', 1),
 (u'r\xe9p\xe9tait', 1),
 (u'blessed,', 2),
 (u'completed', 8),
 (u'Directed', 2),
 (u'allow?', 1),
 (u'Convert', 2),
 (u'feverishly.', 1),
 (u'lui.--O', 1),
 (u'panicked.', 1),
 (u'Redoubt--and', 1),
 (u'Brutus:', 3),
 (u'Alarm', 1),
 (u'see;', 3),
 (u'report?', 1),
 (u'cordes', 1),
 (u'JEAN', 1),
 (u'hunts,', 3),
 (u"l'emporte", 1),
 (u'(Though', 1),
 (u'saule.', 3),
 (u'bees.', 2),
 (u'content;', 2),
 (u'freely.', 4),
 (u'Went', 2),
 (u'difficulties,', 1),
 (u'feeble,', 5),
 (u'grave', 32),
 (u'pillows.', 3),
 (u'hut', 35),
 (u'initiator', 1),
 (u'don', 3),
 (u'Immoderately', 1),
 (u'unripe,', 1),
 (u'Unthrifty', 1),
 (u'medical', 9),
 (u'm', 3),
 (u'retreated.', 2),
 (u"d'accord,", 1),
 (u'investing', 1),
 (u'courrouc\xe9e', 1),
 (u'justifiability', 1),
 (u'Karl', 7),
 (u'another,"', 5),
 (u"me'?", 1),
 (u'thresh', 1),
 (u'letters;', 1),
 (u'flog', 2),
 (u'incomplete', 1),
 (u'duets', 1),
 (u'ETERNITY', 1),
 (u'memory),', 1),
 (u'Napoleon,"', 1),
 (u'bait', 4),
 (u'come."', 4),
 (u'wrathfully', 2),
 (u'insects', 2),
 (u'blazed', 1),
 (u'harness,', 5),
 (u'liest', 3),
 (u'him--evidently', 1),
 (u"enjoin'd", 1),
 (u'convulsively,', 1),
 (u'resuming', 3),
 (u'"This,', 1),
 (u'ROSS,', 9),
 (u'appease', 2),
 (u'Anne', 1),
 (u'succeed.', 1),
 (u'Rostova?"', 1),
 (u'lessons', 8),
 (u'ear.', 14),
 (u'submit', 14),
 (u'shot!', 2),
 (u'gantry.', 2),
 (u'vexations', 1),
 (u'Parle,', 1),
 (u'vilest,', 1),
 (u'Duchenois,', 1),
 (u'Bib', 19),
 (u'haste;', 4),
 (u'"Going', 2),
 (u'mistake,', 6),
 (u'hurrah!', 4),
 (u'bush', 2),
 (u'"Sideways!', 1),
 (u'Mary:', 1),
 (u'specify', 1),
 (u'everything...', 2),
 (u'dullness', 2),
 (u'stores', 8),
 (u'radically', 1),
 (u"dispatch'd:", 1),
 (u'reverently', 1),
 (u'precedence;', 1),
 (u"Kutuzov's.", 1),
 (u'"Enchantress,"', 1),
 (u'came;--with', 1),
 (u'ENOBARBUS]', 4),
 (u'predestined', 8),
 (u'"Weren\'t', 1),
 (u'fruits,', 1),
 (u'lies.', 8),
 (u'view:', 1),
 (u"Capel's", 2),
 (u'(Yoda', 1),
 (u'comply', 17),
 (u'malgr\xe9', 1),
 (u'maintenant?--Malheureuse', 1),
 (u'debt:', 2),
 (u'Fife', 1),
 (u'elope', 5),
 (u'men:', 5),
 (u'noon', 4),
 (u'post-horses;', 1),
 (u'"Love...', 1),
 (u'Grandeur', 1),
 (u'dipping', 2),
 (u'flight?', 1),
 (u'thrusts', 3),
 (u'remembers...', 1),
 (u'Norway;', 1),
 (u'stead', 1),
 (u'reviving', 1),
 (u'flanked', 4),
 (u"Vasili's", 25),
 (u'advance.', 14),
 (u'"man', 4),
 (u"'overresist'", 1),
 (u'Montague!', 2),
 (u"Wha-wha-what's", 1),
 (u'fascinating', 7),
 (u'impotent', 2),
 (u'yawned', 1),
 (u'admiral,', 1),
 (u'Seslavin.', 1),
 (u'guerre', 6),
 (u'malhabile', 1),
 (u"look'st", 4),
 (u'originated', 1),
 (u'calmer', 5),
 (u'flaxen', 1),
 (u"minute's", 1),
 (u'RODERIGO.--Bah!', 1),
 (u'twenty-two', 4),
 (u'material.', 2),
 (u'STOLZ,', 1),
 (u'Clouds', 2),
 (u'KNOWLTON', 1),
 (u'Hardly', 12),
 (u'Bourbon', 1),
 (u'granted?..."', 1),
 (u'you--the', 1),
 (u'demande.', 2),
 (u'finish.', 10),
 (u"Beauharnais'", 2),
 (u'Abramovna', 1),
 (u'rings', 8),
 (u'like...', 2),
 (u'BUT', 2),
 (u'known..."', 1),
 (u'd\xe9serte:', 1),
 (u'sais', 19),
 (u'distaste,', 1),
 (u'disprove', 1),
 (u'caught;', 1),
 (u'imprisonment,', 2),
 (u'sorts', 12),
 (u'muzzle,', 1),
 (u'Brumaire,', 1),
 (u"'pon", 1),
 (u'foe,', 9),
 (u'preserve', 17),
 (u'externally', 3),
 (u'noirceur.', 2),
 (u'loves', 53),
 (u'swindler!', 1),
 (u'rolled', 22),
 (u'pontoon', 1),
 (u'torrent,', 1),
 (u'distinctness.', 1),
 (u'sleepers.', 1),
 (u'red-nosed', 4),
 (u"s'unirent", 1),
 (u'alteration,', 2),
 (u'Evades', 1),
 (u'god-den.', 2),
 (u"'and", 2),
 (u'Countess,"', 6),
 (u'distrust--but', 1),
 (u'Here!', 1),
 (u'dishonor', 2),
 (u'enemy?"', 1),
 (u'you--', 3),
 (u'blasted', 10),
 (u'bras', 8),
 (u'foresight', 2),
 (u'Kuzmich--a', 1),
 (u'ring;', 1),
 (u'rum.', 1),
 (u'veer', 3),
 (u'Testament', 1),
 (u'BRUCE', 6),
 (u'coronation', 1),
 (u'disarms', 1),
 (u'wines,', 1),
 (u'Shore', 1),
 (u'against', 355),
 (u'Laura', 1),
 (u'sort;', 1),
 (u'Cassio', 103),
 (u'devotion', 23),
 (u'Convey', 1),
 (u'demander', 6),
 (u'votre-fille', 1),
 (u'adapted', 2),
 (u'Freemasons,', 1),
 (u'Whip', 4),
 (u'DESD\xc9MONA.--Au', 1),
 (u'hilly.', 1),
 (u'"Too', 3),
 (u'connection', 39),
 (u'vain', 14),
 (u'JAGO.--Soit.', 1),
 (u'addressing', 69),
 (u'Tushin.', 10),
 (u'offerings', 1),
 (u'"intended."', 1),
 (u'Shout', 1),
 (u'dehors.)', 1),
 (u"match'd,", 2),
 (u'wider', 3),
 (u'honneurs', 1),
 (u'humming', 7),
 (u'politesse.', 1),
 (u'durst,', 1),
 (u'Galactic', 4),
 (u'weeps,', 1),
 (u'deviations', 3),
 (u'Petersbourg,', 1),
 (u'asks', 16),
 (u'demesnes', 1),
 (u're\xe7u...', 1),
 (u'contemplate.', 1),
 (u'received--assumed', 1),
 (u'Early', 6),
 (u'devil...', 1),
 (u'flank?"', 2),
 (u'lingo."', 1),
 (u'obedient', 3),
 (u'mane,', 2),
 (u'appears:', 1),
 (u'duties', 23),
 (u'disseat', 1),
 (u'friends,"', 4),
 (u'ponds,', 1),
 (u'aies', 1),
 (u'man"', 1),
 (u'drugs', 4),
 (u'Dmitri,', 4),
 (u'afterwards!"', 1),
 (u'needed.', 7),
 (u'Traitors', 1),
 (u"Revisit'st", 1),
 (u'ramassez', 1),
 (u'seminarists,', 1),
 (u'dashing,', 2),
 (u"Mercutio's", 5),
 (u'patronizing', 4),
 (u'shambles_,', 1),
 (u'"Women,"', 1),
 (u'ordnance', 3),
 (u'Tickle', 1),
 (u'exception', 14),
 (u'pitied...."', 1),
 (u'bridges.', 2),
 (u'disgraced,', 1),
 (u'gods.', 2),
 (u'rejoinder.', 2),
 (u"d'elle.", 3),
 (u'men?"', 1),
 (u'revolutions,', 3),
 (u'Ce', 20),
 (u'aim', 96),
 (u'rests', 16),
 (u'misconstrued', 1),
 (u'divinities?', 1),
 (u'brotherly', 7),
 (u'advice,"', 1),
 (u'quick!', 3),
 (u'locality.', 2),
 (u'chilling', 2),
 (u'l\xe9g\xe8res.', 2),
 (u'faith.', 5),
 (u"'Fear", 2),
 (u'policeman,', 2),
 (u"learned's", 1),
 (u'losing', 43),
 (u'jutting', 1),
 (u'consent,', 7),
 (u'onze', 1),
 (u'us--doing', 1),
 (u"bailiff's", 1),
 (u'tugged', 5),
 (u'DESD\xc9MONA.--Quoi!', 2),
 (u'erred,', 1),
 (u'Pompey?', 1),
 (u'voyage.', 1),
 (u'top!', 1),
 (u'grown', 81),
 (u'quand', 49),
 (u'Slav', 1),
 (u'jolting', 4),
 (u'CVIII.', 1),
 (u'thunderbolts;', 1),
 (u'finger,', 8),
 (u'hated,', 5),
 (u'forehead', 32),
 (u'grapple', 1),
 (u'COLE,', 1),
 (u'"Granddad."', 2),
 (u'Petersburg.', 36),
 (u'naive', 22),
 (u'speaks:', 5),
 (u'lunes', 1),
 (u'qualities)', 1),
 (u'impotence.', 1),
 (u'Suspended', 1),
 (u'ill-usage', 1),
 (u'MACDUFF,', 7),
 (u'posed', 1),
 (u'men!...', 1),
 (u'reasons."', 1),
 (u'qualifications', 1),
 (u'evil....', 1),
 (u'aides-de-camp,', 3),
 (u'Men,', 2),
 (u'dimness', 1),
 (u'concourse', 2),
 (u'Carrying,', 1),
 (u'required.', 2),
 (u'vieux', 4),
 (u'workings', 2),
 (u'labor--idleness--was', 1),
 (u'"Madame', 1),
 (u'talking,', 14),
 (u'made.', 12),
 (u'peaked', 3),
 (u'shortly', 8),
 (u'recognizable.', 1),
 (u'thy', 741),
 (u'Gratiano._)', 1),
 (u'because...', 11),
 (u'FRIGATE.', 1),
 (u'bottle;', 1),
 (u'"Drain', 1),
 (u'conseils', 1),
 (u'Confound', 2),
 (u'skillful', 12),
 (u'stillness', 11),
 (u'grow:', 1),
 (u'qualities', 10),
 (u'cool,', 1),
 (u'mason,', 2),
 (u"unpress'd", 1),
 (u'meadows...', 1),
 (u'veteran', 2),
 (u'Reassured,', 1),
 (u'soldats,', 1),
 (u'recevoir.', 1),
 (u'song,', 8),
 (u'burn,', 4),
 ...]

Using the Cache

Pull a data sets into a cluster-wide in-memory cache.

Very useful when using iterative algorithms.


In [15]:
textFile.cache()


Out[15]:
data/books-eng/hamlet.txt MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

Find words that start with an 'H' or 'h'


In [68]:
words = textFile.flatMap(lambda line: line.split())
words_subset = words.filter(lambda x: x[0] in ['H','h'])
counts = words_subset.map(lambda w: (w,1)).reduceByKey(lambda a,b: a+b)
counts.collect()


Out[68]:
[(u"host's", 2),
 (u'humid', 1),
 (u'huge,', 8),
 (u'hillock.', 1),
 (u'hurrah!', 4),
 (u'hurry?"', 1),
 (u'Had', 59),
 (u'habilet\xe9;', 1),
 (u'hibernation', 1),
 (u'heirs', 1),
 (u'harness,', 5),
 (u'historians--those', 1),
 (u'health..."', 1),
 (u'hear,"', 1),
 (u'honesty,', 3),
 (u'historians--the', 1),
 (u'himself.)', 1),
 (u'hands,', 71),
 (u'heels;', 1),
 (u'hopeful', 3),
 (u'her,"', 12),
 (u'haste;', 4),
 (u'haquen\xe9es', 1),
 (u'hampered', 2),
 (u"ha't", 1),
 (u'highroad--polished', 1),
 (u'host;', 1),
 (u'HUSTON,', 2),
 (u'hope', 101),
 (u'happened,', 17),
 (u'Ha,', 14),
 (u'HANSON,', 1),
 (u'Helena.', 1),
 (u'history.', 15),
 (u'HUM.', 1),
 (u'herbs,', 3),
 (u'home?"', 3),
 (u'hut--and', 1),
 (u'hedge', 2),
 (u'him--in', 3),
 (u'http://pglaf.org/fundraising.', 2),
 (u'howling.', 2),
 (u'hastes', 1),
 (u'hugged', 5),
 (u'houses,', 21),
 (u'horizons', 1),
 (u'hosts', 1),
 (u'hate;', 2),
 (u"hack'd", 1),
 (u'harness...."', 1),
 (u'hems.', 1),
 (u'husband', 97),
 (u'hour!', 2),
 (u'Hmph.', 1),
 (u'hounds:', 1),
 (u'hangings.', 1),
 (u'hovers.', 1),
 (u'handing', 13),
 (u"headquarters.'", 1),
 (u'Him!...', 1),
 (u'hors', 5),
 (u'hive', 13),
 (u'has:', 1),
 (u'house:', 6),
 (u'Hardly', 12),
 (u'hopelessly', 1),
 (u'holds,', 2),
 (u'haste-post-haste', 1),
 (u'him--felt', 1),
 (u'hide', 60),
 (u'her--and', 3),
 (u'hooked', 4),
 (u'heiresses', 4),
 (u'horrible.', 1),
 (u'health', 48),
 (u'horrors;', 1),
 (u'honours.', 1),
 (u'HOME,', 1),
 (u'hard!"', 1),
 (u'hero', 17),
 (u'headquarters,', 3),
 (u'honest:', 1),
 (u'him--who', 1),
 (u'him--especially', 1),
 (u'hairs', 9),
 (u'Harder', 1),
 (u'hobbledehoy', 1),
 (u'hunter!', 2),
 (u'hectic', 1),
 (u"hadn't", 8),
 (u'Here!', 1),
 (u'happiness!', 5),
 (u'Hyperion', 1),
 (u'happening.', 4),
 (u'horses,"', 2),
 (u'healths', 1),
 (u'Hohenlohe,', 1),
 (u'himself,', 163),
 (u'hinting', 1),
 (u'hearer,', 1),
 (u'homme!"', 1),
 (u'hilts,', 1),
 (u'Hessian', 4),
 (u'Human', 2),
 (u'Hecuba!', 1),
 (u'households', 2),
 (u'him!"', 31),
 (u"honor--that's", 1),
 (u'hommes.--Si', 1),
 (u'History,', 1),
 (u'head]', 1),
 (u'H\xc9RAUT.', 1),
 (u'Hurry,', 2),
 (u'housewife', 4),
 (u'humming', 7),
 (u'happiness--floated', 1),
 (u'hide?', 1),
 (u'How,', 4),
 (u'hitting', 10),
 (u'Happily', 2),
 (u'hospital', 18),
 (u'ha-ve', 1),
 (u'hangs', 19),
 (u'humours!', 1),
 (u'headdress', 2),
 (u'higher,', 5),
 (u'hard,', 15),
 (u'high-pitched', 5),
 (u'happiness,"', 1),
 (u'holiday?"', 1),
 (u'hushed.', 1),
 (u'hut', 35),
 (u'husband."', 1),
 (u'holder', 2),
 (u'HANGAR.', 1),
 (u'Hollabrunn,', 1),
 (u'Hmm.', 4),
 (u'Han!', 6),
 (u"hare's", 3),
 (u'head;', 7),
 (u'holiday.', 1),
 (u'highly,', 3),
 (u'horsecloths,', 2),
 (u'Honneur', 2),
 (u"haven't", 33),
 (u'he,', 182),
 (u'heart-ache', 1),
 (u'haters.', 1),
 (u'hadst,', 1),
 (u'honneurs', 1),
 (u'Horatio:', 2),
 (u'horseflies', 1),
 (u'hull,', 3),
 (u'hot."', 2),
 (u'heavier.', 1),
 (u'hearts.', 5),
 (u'Holding', 7),
 (u'handle', 9),
 (u"huntsman's", 2),
 (u'happiness--ever', 1),
 (u'honey-and-nut', 1),
 (u'hilly.', 1),
 (u'him--evidently', 1),
 (u'her--might', 1),
 (u'Hurrah', 3),
 (u'Honneur;', 1),
 (u'Hoch', 1),
 (u'hated,', 5),
 (u'half-witted', 1),
 (u'hosts!', 1),
 (u'her--please', 1),
 (u"hunt's-up", 1),
 (u'HENSON,', 1),
 (u'heures', 5),
 (u'HORATIO,', 7),
 (u'heads."', 1),
 (u'he!"', 1),
 (u'hideuse', 1),
 (u'himself),', 2),
 (u'horsecloth,', 1),
 (u'Hello,', 3),
 (u'hunts,', 3),
 (u'hereupon', 1),
 (u'hear!', 2),
 (u'Hey...', 1),
 (u'Hung', 1),
 (u"hoof's", 1),
 (u'hearer:', 1),
 (u'Heights.', 2),
 (u'hommes.', 1),
 (u'housewifery,', 1),
 (u'heir;', 2),
 (u'holes', 4),
 (u'happiness--the', 1),
 (u'him--he', 7),
 (u'handrails.', 1),
 (u'heure;', 1),
 (u'habitera', 1),
 (u'hates', 2),
 (u'habit.', 3),
 (u'hast:', 1),
 (u'help', 188),
 (u'heavily', 26),
 (u'highmost', 2),
 (u'Hmmm?', 1),
 (u'history--civil', 1),
 (u'Han...', 2),
 (u'hath', 297),
 (u'heures,', 1),
 (u'Hills--even', 1),
 (u'hits.', 1),
 (u'Has', 9),
 (u'householders', 1),
 (u'hardhearted', 2),
 (u'happiness.', 28),
 (u'His', 422),
 (u'Hadst', 4),
 (u'here...certainly', 1),
 (u'haste:', 1),
 (u'honesty?', 2),
 (u'hell!', 4),
 (u'horrible."', 1),
 (u'helps', 11),
 (u"hush'd", 1),
 (u'HOOTKINS', 1),
 (u'host:', 1),
 (u'he.--Villain,', 1),
 (u'Hetzelsdorf', 1),
 (u'hautboys', 1),
 (u'halt', 13),
 (u'hm?', 1),
 (u'harms.', 1),
 (u'heathenish', 1),
 (u'here...', 5),
 (u'himself--not', 1),
 (u'hat.', 4),
 (u'horreurs,', 1),
 (u'Heights', 3),
 (u'Hamlet!', 6),
 (u'homestead', 8),
 (u'happen!"', 2),
 (u'hales,', 1),
 (u'heightened', 4),
 (u'had.', 5),
 (u'homme!', 1),
 (u'hangs,', 1),
 (u"ha'", 6),
 (u'helpless,', 1),
 (u'hurrah!"', 3),
 (u'homme?', 3),
 (u'heading', 15),
 (u'hop,', 1),
 (u'husband--God', 1),
 (u'habit', 32),
 (u'Hark,', 6),
 (u'Hollandais', 2),
 (u'headquarters--because,', 1),
 (u'HODENFIELD,', 1),
 (u'honest-hearted', 1),
 (u'harder,', 3),
 (u'has;', 2),
 (u'house;', 9),
 (u"Hamlet's", 8),
 (u'helped?"', 1),
 (u'Had,', 1),
 (u'harmlessly.', 1),
 (u'humiliate', 2),
 (u'holder,', 2),
 (u'harvested', 1),
 (u'healthy,', 3),
 (u'Ho!', 5),
 (u'hard."', 2),
 (u'hussars?"', 2),
 (u'Hello?', 1),
 (u'hommes!--Dis-moi,', 1),
 (u'hollering', 2),
 (u'hearted,', 1),
 (u'houses?', 2),
 (u'HOLMAN', 1),
 (u'hardship', 1),
 (u'honest;', 2),
 (u'hate:', 1),
 (u"Hasn't", 2),
 (u'hurrying,', 1),
 (u'HAMLET', 367),
 (u'houses', 33),
 (u'herring:', 1),
 (u'horseflesh', 4),
 (u'Hot.', 1),
 (u'hundreds', 28),
 (u'Honor', 5),
 (u'handiwork.', 3),
 (u'horses.', 27),
 (u'hilts;', 1),
 (u'heave,', 1),
 (u'her--considerate,', 1),
 (u'hatches', 1),
 (u'happening!', 1),
 (u'hot!"', 2),
 (u"have't!", 1),
 (u'howbeit', 2),
 (u'humour.', 1),
 (u'hunger', 12),
 (u'happiness...."', 1),
 (u'have?"', 3),
 (u'honor;', 1),
 (u'Huge', 2),
 (u'honest;--so', 1),
 (u'Heaven."', 1),
 (u'head,', 117),
 (u'hollow', 27),
 (u'honneur.', 1),
 (u'him--we', 1),
 (u'honor)', 1),
 (u'husbanded?', 1),
 (u"highness'", 5),
 (u'hmmm?', 1),
 (u'home?', 4),
 (u'hollows', 5),
 (u'habiles', 1),
 (u"hear't", 1),
 (u'huh,', 1),
 (u'Henry', 1),
 (u'handkerchief,', 11),
 (u'hideous?"', 1),
 (u'hug', 3),
 (u'hunters.', 1),
 (u'horseback--raised', 1),
 (u'Huttesse', 1),
 (u'Hands', 1),
 (u'hoodman-blind?', 1),
 (u'hands?', 1),
 (u"heap'd", 3),
 (u'harmful--gave', 1),
 (u'hard;', 3),
 (u'hundred,"', 1),
 (u'herein;', 1),
 (u'handsome', 90),
 (u'holiday', 6),
 (u'HUMAN:', 4),
 (u'he?', 25),
 (u'head:', 7),
 (u'humanity', 27),
 (u'hazard.', 2),
 (u'hasard', 1),
 (u'hot,', 5),
 (u'history', 106),
 (u'heart', 229),
 (u'honn\xeate?', 2),
 (u'hitting,', 1),
 (u'heads.', 9),
 (u'housemaid.', 1),
 (u'howling', 4),
 (u'hand,"', 3),
 (u'high,', 22),
 (u'Horatio;', 1),
 (u'highroad.', 4),
 (u'hostess.', 8),
 (u'hairy,', 2),
 (u'her--that', 2),
 (u'hark,', 1),
 (u'heart!', 5),
 (u'honors', 9),
 (u'historians.', 1),
 (u'how!', 1),
 (u'health?"', 2),
 (u"heart's", 14),
 (u'heritage', 1),
 (u'horse?"', 1),
 (u'howl.', 4),
 (u'herself,', 58),
 (u'hallway', 13),
 (u'hands--those', 1),
 (u"H\xc9RAUT.--C'est", 1),
 (u'heel,', 2),
 (u'holy,', 1),
 (u'him--', 1),
 (u'hostelry,', 1),
 (u'here"', 1),
 (u'holily;', 1),
 (u'happier.', 3),
 (u'Henker', 1),
 (u'history,"', 2),
 (u'honeying', 1),
 (u'Herein', 2),
 (u"happ'd.", 1),
 (u'happiest,', 1),
 (u'hollow-cheeked', 1),
 (u'himself--I', 1),
 (u'humeurs.', 1),
 (u'heaps', 1),
 (u'Hide', 2),
 (u'Heart,', 2),
 (u'hero--but', 1),
 (u'hence?', 2),
 (u'Hercules,', 2),
 (u'hers?', 1),
 (u"hiss'd", 1),
 (u'HOLT,', 2),
 (u'Hmmm.', 1),
 (u'horseman,', 1),
 (u'hand:', 18),
 (u'Hero', 1),
 (u'heights,', 1),
 (u'handkerchiefs;', 1),
 (u'hereabouts', 1),
 (u'hither;', 2),
 (u'herd,', 1),
 (u'Hippolyte.', 2),
 (u'harnessing', 1),
 (u'hap', 4),
 (u'Hamburg,', 1),
 (u'hail!', 3),
 (u'harp,', 1),
 (u'Harvest', 1),
 (u'him...', 13),
 (u'happened.', 17),
 (u'horn,', 3),
 (u'haut', 1),
 (u'hunter', 13),
 (u'Hamlet,', 25),
 (u'headquarters?"', 1),
 (u'Hollow.', 1),
 (u'hurting', 3),
 (u'here."', 9),
 (u'hotshot', 1),
 (u'hat!', 1),
 (u'hunted', 4),
 (u'holiday:', 1),
 (u'heeded.', 1),
 (u'hope;', 1),
 (u'howling,', 1),
 (u'had!', 2),
 (u'houses--it', 1),
 (u'HECATE', 4),
 (u'honourably.', 1),
 (u'Hurry.', 1),
 (u'halfway.', 3),
 (u'herald', 5),
 (u'hereafter:', 1),
 (u'homing', 1),
 (u'hol\xe0,', 2),
 (u'horseflesh,', 1),
 (u'halfworld', 1),
 (u"hate'", 3),
 (u'hems,', 1),
 (u'hour....', 1),
 (u'harder', 8),
 (u'hare', 18),
 (u'house.', 82),
 (u'hush', 3),
 (u'Hart,', 2),
 (u'horned', 2),
 (u'heretic,', 1),
 (u'hives;', 1),
 (u'happening--Kutuzov', 1),
 (u'happens', 44),
 (u'heartily;', 1),
 (u'horn--played', 1),
 (u'home...', 1),
 (u"hair'?", 1),
 (u'here,"', 30),
 (u'hydraulic', 3),
 (u'h\xe9riss\xe9es', 1),
 (u'house...', 3),
 (u'Hoth,', 1),
 (u'Helene--having', 1),
 (u'honor!"', 9),
 (u'hinder', 12),
 (u'happen,"', 4),
 (u'huntsmen', 10),
 (u'harsh.', 2),
 (u'horreur', 1),
 (u"Helena.'", 1),
 (u'hold!"', 1),
 (u'historians', 95),
 (u'Hey?"', 1),
 (u'Hit', 1),
 (u'himself!...', 1),
 (u'he.', 118),
 (u'half-severed', 1),
 (u'halted', 14),
 (u'happening,', 6),
 (u'Hark', 6),
 (u'Hungarian', 2),
 (u'him...I', 1),
 (u'his,', 13),
 (u'help:', 2),
 (u'himself.', 150),
 (u'Hail', 2),
 (u'H.', 4),
 (u'heel', 8),
 (u'Harness,', 1),
 (u'here!', 31),
 (u'Happiness', 1),
 (u'hostile,', 2),
 (u'heedful', 1),
 (u'have;', 4),
 (u'horse?', 1),
 (u'Henceforth', 2),
 (u"How'd", 1),
 (u'hurdle', 1),
 (u'Hist!', 1),
 (u'Hail!', 3),
 (u'highly.', 2),
 (u"he'd", 4),
 (u'Hotel.', 1),
 (u'H-a-a-a...', 1),
 (u'her', 4543),
 (u'hit--a', 1),
 (u'hazards', 1),
 (u'heroism', 7),
 (u'hothouses.', 1),
 (u'handkerchief!', 4),
 (u'Hoist', 1),
 (u'Hofkriegsrath', 6),
 (u'HAGANS', 1),
 (u'humour?', 1),
 (u'hardly', 81),
 (u'husband)', 1),
 (u'headache,', 1),
 (u'head', 515),
 (u'hard.', 2),
 (u"hawk'd", 1),
 (u'http://pglaf.org', 4),
 (u'hallooing--and', 1),
 (u'Hence-banished', 1),
 (u"hussar's", 4),
 (u'HARRISON', 1),
 (u"hussars'", 1),
 (u'her."', 21),
 (u'hole', 13),
 (u'herself', 239),
 (u'hands.', 60),
 (u'harm."', 1),
 (u'hesitant', 1),
 (u'hypocrites;', 1),
 (u'home.', 38),
 (u'harbour?', 1),
 (u'horses;', 6),
 (u'hovering', 1),
 (u'heiress', 7),
 (u'h\xe9ritier', 1),
 (u'horses)', 1),
 (u'hug.', 5),
 (u'her?--The', 1),
 (u'heard:', 2),
 (u'hearts,', 11),
 (u'hours?', 3),
 (u'her...immediately!', 1),
 (u'household', 32),
 (u'ha-ha-ha.', 1),
 (u'hermit.', 1),
 (u'horrors', 10),
 (u'husband;', 3),
 (u'honn\xeate.', 2),
 (u'hides', 5),
 (u'halt:', 1),
 (u'hautes', 1),
 (u'hoar', 4),
 (u'human-cyborg', 2),
 (u'Hope', 1),
 (u'horsecloths.', 1),
 (u'himself--who', 3),
 (u'HARTNEY', 1),
 (u'him?...', 1),
 (u'high-coloured.', 1),
 (u'hologram.', 1),
 (u'hallway.', 11),
 (u'hatch,', 1),
 (u'horror', 31),
 (u'heraldry,', 1),
 (u'handled.', 1),
 (u"horse's", 15),
 (u'higher.', 2),
 (u'had."', 1),
 (u'husbands', 12),
 (u'hitched', 1),
 (u'Haste', 2),
 (u'horreurs', 1),
 (u"husband's", 25),
 (u'happier', 10),
 (u'happier?', 1),
 (u"head's", 3),
 (u'Hutt.', 5),
 (u'house-affairs', 1),
 (u'historic,', 1),
 (u'humanity.', 7),
 (u'him--at', 1),
 (u'here!..."', 1),
 (u'Holla!', 2),
 (u'humor,', 6),
 (u'housemaid...', 1),
 (u'hardi', 2),
 (u'high-battled', 1),
 (u'hands', 208),
 (u'hits:', 1),
 (u'handicraft,', 1),
 (u'hurlements', 1),
 (u'happen),', 1),
 (u'hastens', 2),
 (u'heiress,', 1),
 (u'hits,', 1),
 (u'hence,', 13),
 (u'happen--despite', 1),
 (u'hand;', 18),
 (u'heave.', 2),
 (u'haughty', 2),
 (u'homage', 5),
 (u'hums', 1),
 (u'headdress,', 1),
 (u"Honour'd,", 1),
 (u'he?!', 1),
 (u'histories', 15),
 (u'heureuse.', 1),
 (u"hand'", 1),
 (u'hither:', 6),
 (u'Hic', 1),
 (u'households,', 2),
 (u'hardened', 2),
 (u'History', 11),
 (u"house's", 1),
 (u'honte', 3),
 (u'Hamlet?', 4),
 (u'hawkers,', 3),
 (u'hostess,', 4),
 (u'horologe', 1),
 (u'HORATIO', 114),
 (u'houses!', 4),
 (u"husband,'fall'st", 1),
 (u'honorable', 13),
 (u'Historic', 1),
 (u'Hills;', 1),
 (u'him', 3435),
 (u'hovel', 1),
 (u'himself?"...', 1),
 (u'harbinger', 1),
 (u'hereabout:', 2),
 (u'heaved', 7),
 (u'human.', 1),
 (u'hair?"', 1),
 (u'hers,', 11),
 (u'Hyperdrive.', 1),
 (u"Highness'", 2),
 (u'hereafter;', 2),
 (u'hero,', 8),
 (u'Here!"', 1),
 (u'hopelessness:', 1),
 (u'help;', 2),
 (u'horse),', 1),
 (u'here"--and', 1),
 (u'hussar!"', 1),
 (u'humbug,', 1),
 (u"hail'd", 1),
 (u'habitually', 1),
 (u'hard', 93),
 (u'hair-breadth', 1),
 (u'honneur,', 2),
 (u'healed', 2),
 (u'hut,', 14),
 (u'Hold;', 1),
 (u'hanged', 7),
 (u'hindrance', 7),
 (u'heels!"', 1),
 (u'hurries', 12),
 (u'Hyrcan', 1),
 (u'humiliation.', 1),
 (u'hourly', 2),
 (u"heaven's", 20),
 (u'holder.', 2),
 (u'hussar,', 7),
 (u'hinds?', 1),
 (u'headquarters,"', 1),
 (u'hilt', 3),
 (u'hearse,', 1),
 (u'honour!', 1),
 (u'howsoever', 1),
 (u'hell-hound,', 1),
 (u'hayfield.', 1),
 (u'harvesting,"', 1),
 (u"ha't,", 1),
 (u'Happy', 4),
 (u'half-closed', 4),
 (u'hobby', 1),
 (u'Heaven', 20),
 (u'heureux', 2),
 (u'he..."', 1),
 (u"honor'", 1),
 (u'Hokey', 1),
 (u'Hover', 1),
 (u'healing', 2),
 (u'heroic.', 1),
 (u'Hot,', 1),
 (u'hunters,', 1),
 (u'hostile', 18),
 (u'hunger-stricken,', 1),
 (u'hearsed', 1),
 (u'humbugged', 2),
 (u'hovers', 3),
 (u'hid,', 2),
 (u'herself.', 41),
 (u'happiness,', 12),
 (u'heirs?"', 1),
 (u'hooted', 1),
 (u'here-approach,', 1),
 (u'handleless', 1),
 (u'Hence,', 5),
 (u'Hungarian?"', 1),
 (u'hotly', 3),
 (u'humblement', 4),
 (u"How's", 2),
 (u'heard;', 1),
 (u'hustles', 1),
 (u'have:', 4),
 (u'Hussars,', 2),
 (u'hitherto', 12),
 (u'hope:', 2),
 (u'How!', 7),
 (u"hawk's", 1),
 (u'heated', 7),
 (u'HOLLY', 1),
 (u'haine,', 2),
 (u'hawk', 2),
 (u'hunger,"', 1),
 (u'ha\xefsse', 1),
 (u'humour,', 4),
 (u'happened,"', 5),
 (u'honor.\'"', 1),
 (u'Hussar,', 1),
 (u'HARRELL', 1),
 (u'hundred', 134),
 (u'halt;', 1),
 (u'happy', 176),
 (u'hesitations,', 1),
 (u"hatch'd,", 1),
 (u'himself,"', 4),
 (u'heroics.', 1),
 (u'happiness!"', 3),
 (u'Ha-ha-ha!', 1),
 (u'hands!', 1),
 (u'headset)', 21),
 (u'HOLOGRAM', 1),
 (u'holographic', 5),
 (u'hold', 171),
 (u'haven', 2),
 (u'he', 8766),
 (u'Hurry', 3),
 (u"hell's", 1),
 (u'How', 375),
 (u'hideously', 1),
 (u'horses:', 1),
 (u'honor?..."', 1),
 (u'hazard,', 1),
 (u'hillside.', 2),
 (u'Helene', 88),
 (u"here--that's", 1),
 (u'headache', 3),
 (u'Himself', 3),
 (u'helm,', 1),
 (u'hair,', 36),
 (u'hideux', 1),
 (u'housekeeper)', 1),
 (u'history--whatever', 1),
 (u'hesitates', 1),
 (u'high-minded', 1),
 (u'HERALD,', 1),
 (u'Holds', 4),
 (u'hear?"', 5),
 (u'heard', 480),
 (u'hours,', 15),
 (u'high.', 3),
 (u'husband:', 4),
 (u'honey,', 3),
 (u'HERMAN', 2),
 (u'historians:', 1),
 (u'hesitate', 1),
 (u'horses!"', 1),
 (u"Hendrikhovna's", 5),
 (u'harrows', 1),
 (u'hardness,', 1),
 (u'hark.', 1),
 (u'house,--But,', 1),
 (u'heels?"', 1),
 (u'haps,', 1),
 (u'Hospital,', 1),
 (u'historic', 34),
 (u'historians,', 13),
 (u'health...', 1),
 (u'hem!_', 1),
 (u'hither?"', 1),
 (u"He'll", 15),
 (u'hurry?"--Nicholas', 1),
 (u'howls.', 4),
 (u'hospital.', 5),
 (u"hand's", 2),
 (u'hadst', 13),
 (u'harbingers', 2),
 (u'habitual,', 1),
 (u'hazel', 3),
 (u'hearing;', 1),
 (u'highroad,', 3),
 (u'him--Rostov.', 1),
 (u'humanity."', 1),
 (u"Hutt's", 1),
 (u'Hum!', 4),
 (u'hare.', 7),
 (u'haltingly)', 1),
 (u'hoarsely', 3),
 (u'happies', 1),
 (u'honestly.', 1),
 (u'honorably,"', 1),
 (u'heirs.', 1),
 (u'hard--', 1),
 (u'honor', 80),
 (u'hearty', 5),
 (u'him--Pierre--depriving', 1),
 (u'handgun.', 1),
 (u'howl,', 5),
 (u'helplessly', 6),
 (u'harm!_', 1),
 (u'here...are', 1),
 (u'Horse', 21),
 (u'hierarchy', 2),
 (u'haute', 5),
 (u'hardtack', 1),
 (u'HARMAN,', 1),
 (u'hence!', 4),
 (u'HUME,', 2),
 (u'humaine', 2),
 (u'headset.', 1),
 (u'herself,"', 1),
 (u'happiness', 77),
 (u'hand--he', 1),
 (u'haven.', 1),
 (u'HUT', 1),
 (u'heavenly;', 1),
 (u'honoring', 1),
 (u'had', 5563),
 (u'hyperspace', 6),
 (u'hey-day', 1),
 (u'h\xe9las!', 4),
 (u'Hid', 1),
 (u'heights.', 1),
 (u'herdsman', 1),
 (u'harsh,', 2),
 (u'him),', 2),
 (u'homepage', 1),
 (u"hideux.--D'ailleurs", 1),
 (u'house.]', 2),
 (u'hair!..."', 1),
 (u'hundred?"', 1),
 (u"Han's", 28),
 (u'hanged.', 1),
 (u'Hamlet.', 11),
 (u'hugging', 3),
 (u'huddles', 2),
 (u'harp.', 2),
 (u'Home"', 2),
 (u'ha,', 17),
 (u'hiccough', 1),
 (u'hey!"', 1),
 (u'Hurrah!', 3),
 (u'honor?...', 1),
 (u'hate!', 4),
 (u'hilding!', 1),
 (u'happily..."', 1),
 (u'hanging,', 1),
 (u'Hate', 2),
 (u'heavy', 106),
 (u'hairy', 9),
 (u'HAROLD', 1),
 (u'hales', 1),
 (u'hurt,', 6),
 (u'humans', 1),
 (u'house,', 77),
 (u'her;', 33),
 (u'H\xe9las!', 2),
 (u'heralds', 3),
 (u'Hurrah!..."', 1),
 (u'humane', 2),
 (u'heave', 4),
 (u'hobby-horse:', 1),
 (u'hesitation', 11),
 (u'hall--"but', 1),
 (u'heat-oppressed', 1),
 (u'HOWARTH', 1),
 (u'hit;', 1),
 (u'hearted;', 1),
 (u'HAMLET,', 4),
 (u'haunt,', 2),
 (u'he?"', 11),
 (u'hirelings', 1),
 (u'honors.', 1),
 (u'Hoth.', 1),
 (u'him"', 4),
 (u'hear?', 8),
 (u'habitude?', 1),
 (u'Hecate!', 1),
 (u'headphones.', 1),
 (u'heathen?', 1),
 (u'Hostilities', 1),
 (u'hey,', 1),
 (u'harmony!"', 1),
 (u'horseflesh!\'"', 1),
 (u'hoarse', 17),
 (u'harshly', 2),
 (u'hurts,', 2),
 (u'handwriting', 1),
 (u'hait', 1),
 (u'here;', 16),
 (u'hits', 35),
 (u'his--has', 1),
 (u'Huttese.', 2),
 (u'his.', 13),
 (u'Hence!', 3),
 (u'heavy;', 1),
 (u'her--for', 1),
 (u'head)', 4),
 (u'hallway,', 8),
 (u'herself!', 4),
 (u'happiness,\'"', 1),
 (u'hams:', 1),
 (u"hero's", 1),
 (u'hummed', 3),
 (u'humour', 8),
 (u'hereditary,', 1),
 (u'habitual', 17),
 (u'her)', 2),
 (u'hard-boiled', 1),
 (u'hangers,', 1),
 (u'high."', 1),
 (u"HAN'S", 1),
 (u'h\xe2ter', 1),
 (u'Herculean', 2),
 (u'horizontal', 2),
 (u'harm,"', 2),
 (u'hangar,', 4),
 (u'highest,', 1),
 (u'horse', 162),
 (u'history),', 1),
 (u'horsemen.', 2),
 (u'hastening', 5),
 (u'Helene.', 17),
 (u'harder--ever', 1),
 (u'hellish', 3),
 (u"huntsmen's", 3),
 (u'hawking', 1),
 (u'hist!', 1),
 (u'honesty:', 1),
 (u'Horizontal', 1),
 (u'Herself', 2),
 (u'home,', 41),
 (u'held:', 1),
 (u'hypothetical', 1),
 (u'http://www.pglaf.org.', 2),
 (u'Himself,', 2),
 (u'histories,', 1),
 (u'huh?', 7),
 (u'hoarfrost.', 2),
 (u'hiding.', 2),
 (u'homely', 3),
 (u'heat', 30),
 (u'hoist', 2),
 (u'Hippolyte,', 8),
 (u'hateful', 8),
 (u'hopes', 27),
 (u'happened!"', 1),
 (u'hindrance,', 1),
 (u"here's", 30),
 (u'heels--we', 1),
 (u'half,', 4),
 (u'happiness."', 3),
 (u'hay.', 2),
 (u'her....', 1),
 (u'he:', 5),
 (u"HAMLET.'", 2),
 (u'Helen.', 1),
 (u'hospitality,', 3),
 (u"hate's", 2),
 (u'harm--and', 1),
 (u'Husband?', 1),
 (u'harshly,"', 1),
 (u'Hood', 1),
 (u'honn\xeate,', 2),
 (u'huh!', 2),
 (u'high', 142),
 (u'http://www.gutenberg.org/1/8/1/7/18179/', 1),
 (u'him--also', 1),
 (u'have', 3081),
 (u'heaven,', 49),
 (u'hymns', 3),
 (u'Hermit,', 1),
 (u'he:--O', 1),
 (u'hell:', 2),
 (u'horns,', 1),
 (u'half-smile', 1),
 (u'h\xe2tez-vous,', 1),
 (u'horse,"', 4),
 (u'hat--a', 1),
 (u"hour's", 5),
 ...]

Cleanup using a regular expression

  • remove all non-alphanumeric characters

In [17]:
import re

# Compile a regular expression that matches non-alphanumerics
pattern = re.compile('[\W_]+', re.UNICODE)

# Replace all non-alphanumerics with a space, then split into words
words = textFile.map(lambda line: pattern.sub(' ',line)).flatMap(lambda line: line.split())

#first_letters = set(['H','h','Q','q','s','S'])
first_letters = set(['A','a'])
# Count words that start with H
words_subset = words.filter(lambda x: x[0] in first_letters)
counts = words_subset.map(lambda w: (w,1)).reduceByKey(lambda a,b: a+b)
res = counts.collect()
for r,c in res:
    print r,c


all 110
antic 1
armour 3
actions 2
admittance 1
aspect 1
amiss 2
appetite 1
accurst 1
Awake 1
advanced 1
admirable 1
Away 2
abused 1
assay 4
advantage 1
account 1
answer 14
Anon 3
accuse 1
animals 1
affections 1
Affection 1
Art 1
arrows 2
abatements 1
audience 5
argal 3
arrow 1
Attends 1
Aroused 1
antique 2
achievements 1
art 16
afflict 1
asleep 2
aloof 2
appurtenance 1
Attendants 9
attend 1
armed 2
away 23
access 1
acting 1
above 4
assurance 2
ass 4
accord 1
assured 2
ask 2
Against 4
abate 1
Addicted 1
along 3
about 15
abstinence 1
accepts 1
assault 1
against 19
aye 1
anon 3
Advancing 1
action 10
approve 2
altitude 1
amber 1
acted 1
apt 2
arras 5
alone 10
alleys 1
another 11
Answer 1
array 1
afoot 1
abominably 1
acquaint 1
addition 3
angry 1
accent 2
audit 1
avouch 1
adoption 1
angels 2
a 469
A 81
Assume 1
awake 1
ashamed 1
And 264
admiration 3
aught 10
apprehension 2
arrant 2
appliance 1
aim 2
Almost 1
assays 1
amble 1
age 9
affection 3
and 706
angel 4
almost 9
abstract 1
am 51
affrighted 1
anger 1
as 154
ay 3
appointment 1
awhile 12
Angels 1
apoplex 1
angle 1
agreeing 1
able 1
admit 1
actor 3
actors 2
ambitious 3
afflicts 1
advise 1
arithmetic 1
ancle 1
About 3
allow 3
accidental 1
applaud 1
advice 3
apparel 1
Admit 1
arrived 1
attent 1
authorities 1
awe 2
ago 1
All 12
assume 3
afar 1
alarm 2
Am 2
As 75
aunt 1
Another 6
Ay 35
Armed 1
avoid 3
amazed 1
affair 4
After 2
ambassador 1
ambition 6
Acts 1
argues 1
arrests 1
aright 1
adders 1
anticipation 1
Ambassador 1
advancement 2
acres 1
amaze 1
Alas 9
attractive 1
augury 1
Affront 1
appal 1
assistant 2
adheres 1
auspicious 1
Adam 2
altogether 1
artery 1
adventurous 1
apart 2
articles 1
always 1
annexment 1
Appears 1
Are 8
arraign 1
Ambassadors 2
At 12
Arm 3
amazement 2
Aside 11
actively 1
are 123
awry 1
ambassadors 3
afeard 1
arm 4
ability 1
army 1
Adieu 4
appear 6
asking 2
across 1
attended 2
abridgement 1
amen 1
apparition 2
answerest 1
According 1
ancient 1
address 1
ache 2
article 2
appears 2
Among 1
accident 3
asunder 1
annual 1
according 1
Antiquity 1
assure 1
abuse 1
afternoon 1
act 14
asked 1
allowance 2
already 6
afterwards 1
Act 5
allegiance 1
abhorred 1
affliction 3
ape 2
Aeneas 1
arrest 1
arms 9
abroad 1
airs 1
ah 2
airy 1
astonish 1
aslant 1
acquire 1
alas 1
assail 1
aery 1
air 13
apiece 1
aid 1
artless 1
anchor 1
Alexander 5
attribute 1
argument 4
an 52
associates 1
at 75
affectation 1
any 17
absolute 2
again 32
afraid 1
author 2
adjoin 1
anoint 1
acts 1
aside 1
aptly 1
also 1
adulterate 1
aboard 3
Absent 1
adieu 3
assigns 2
affairs 1
amities 1
after 12
acquaintance 1
ardour 1
Abuses 1
Alack 3
ambiguous 1
Ah 4
An 7
accounted 1
absurd 2
axe 3
accidents 1

Counting Letter bigrams


In [18]:
import re

#textFile = sc.textFile("notes/data/books-eng/hamlet.txt")
textFile = sc.textFile("data/books-eng")

# Compile a regular expression that matches non-alphanumerics
pattern = re.compile(u'[\W0-9_]+', re.UNICODE)

# Replace all non-alphanumerics with a space, then split into words
words = textFile.map(lambda line: pattern.sub(' ',line)).flatMap(lambda line: line.split())
# Convert to lower case
words = words.map(lambda w: w.lower())

# Convert to a list of letters list('abc') = ['a','b','c']
letters = words.flatMap(lambda word: [pair[0]+pair[1] for pair in zip(list('_'+word),list(word+'_')) ] )
counts = letters.map(lambda w: (w,1)).reduceByKey(lambda a,b: a+b)
bigrams = counts.collect()

In [19]:
for r,c in sorted(bigrams,key=lambda x:x[1],reverse=True):
    print r,c


e_ 173007
_t 128896
th 106890
s_ 102616
he 101877
d_ 99475
_a 97158
t_ 92487
_s 70680
_h 67773
in 66235
an 64815
n_ 63820
er 63564
_w 56282
r_ 53687
re 51632
_i 49719
_o 48674
y_ 47659
nd 47369
o_ 43800
on 40854
ou 40280
en 40046
at 39581
ha 39353
_b 38727
_m 36324
_c 35647
hi 35341
to 34894
ed 33922
is 33107
ng 32960
_f 32956
es 32443
_d 31614
it 31281
or 31094
ar 30018
g_ 28707
_p 28563
_l 28281
as 28232
st 28109
te 27940
le 27823
f_ 26631
se 26373
nt 25097
a_ 24603
me 24098
h_ 23958
l_ 23453
_n 23156
ve 23023
de 22741
of 22605
ll 22459
ro 22383
_r 21805
ea 21744
ne 21493
al 21160
_e 20821
ti 19941
ho 19929
no 19729
co 19284
ri 19224
ce 18776
m_ 18059
be 17505
om 16780
el 16726
wh 16362
ut 16327
ch 16300
_g 16170
us 15654
ot 15603
ur 15515
ma 15342
wi 15264
wa 15242
sh 15099
ad 14911
ow 14882
lo 14702
li 14346
et 14133
ra 13998
si 13911
ee 13729
ss 13609
ai 13581
il 13499
so 13246
ta 13229
ie 13151
_y 13113
un 13042
i_ 13034
la 12952
im 12935
pe 12401
fo 12390
ly 12176
rs 12118
io 11992
nc 11800
w_ 11671
ic 11600
oo 11477
yo 11459
we 11425
ca 11390
sa 11272
id 11247
di 11239
mo 11017
ac 10786
ol 10729
pr 10559
tr 10317
em 10279
ke 10143
u_ 10091
ir 9822
ns 9769
ld 9755
do 9531
os 9357
ge 9287
ul 9201
gh 9156
ov 8964
_u 8948
k_ 8845
rt 8572
ni 8533
ay 8478
_v 7962
pa 7962
mi 7841
av 7793
na 7772
po 7752
ec 7708
fr 7596
bu 7534
am 7441
ig 7416
ry 7187
rd 7177
fi 7118
ev 7090
wo 7065
vi 6727
fe 6568
ht 6539
pi 6433
fa 6346
rr 6332
tt 6262
ts 6242
go 6233
su 6067
pl 6044
bo 5934
_k 5928
ag 5854
ct 5848
ap 5810
tu 5809
ei 5797
ia 5646
bl 5635
ba 5542
qu 5413
ey 5348
p_ 5299
if 5295
ep 5285
my 5275
sp 5248
od 5187
mp 5122
iv 5121
ru 5012
ab 4900
ck 4831
ki 4823
ex 4745
op 4713
dr 4704
ug 4669
ga 4574
ak 4570
ok 4554
gr 4535
_j 4453
da 4419
ue 4411
oi 4408
ew 4397
au 4336
up 4301
ef 4282
ty 4261
tl 4173
ff 4154
rm 4150
lu 4132
ci 4017
ye 4006
ds 3963
rn 3742
_q 3719
by 3658
gi 3641
br 3627
pp 3590
cr 3543
ui 3524
uc 3404
sc 3403
cl 3347
vo 3309
va 3205
nn 3200
ny 3185
wn 3176
v_ 3058
aw 3003
kn 3001
lf 2984
eg 2768
ft 2743
mu 2734
oc 2667
fu 2642
hr 2635
ls 2621
hu 2611
ms 2517
cu 2509
nk 2496
nl 2478
sk 2471
eo 2420
lt 2410
rc 2387
gu 2374
um 2364
hy 2357
ud 2344
mm 2337
pt 2303
pu 2250
af 2213
sm 2141
rg 2128
du 2104
fl 2038
bi 2034
gl 1999
ik 1979
rk 1949
ys 1936
eu 1929
ua 1925
sl 1884
dy 1875
ju 1870
mb 1859
oa 1839
ps 1830
tw 1819
uk 1815
ob 1807
rl 1802
je 1785
ks 1729
ib 1723
ip 1676
dd 1651
xp 1610
dl 1513
iu 1510
c_ 1495
nu 1486
gn 1485
ya 1471
gs 1441
cc 1433
sw 1410
lk 1376
nf 1361
yi 1358
ub 1352
oy 1257
ws 1226
tc 1220
rv 1203
og 1104
jo 1099
ze 1081
nv 1052
xt 1028
x_ 1026
eh 997
oe 987
ae 922
z_ 864
yt 863
ez 856
wr 829
iz 824
ph 822
rf 820
kh 818
ja 811
rp 809
xc 781
ml 772
dg 770
gg 759
bs 744
xi 718
à_ 714
zo 701
ku 695
eb 695
ka 680
sn 659
lm 658
uz 652
lv 640
ko 632
uf 632
rb 628
é_ 626
oh 624
sd 620
xe 606
lp 594
sy 584
_à 578
lw 554
dé 545
_é 537
vn 532
lr 530
lc 527
ah 519
sb 514
kl 499
az 497
ém 481
py 481
bb 471
cy 467
dn 456
ix 450
hm 445
xa 442
ek 439
_x 438
wd 433
b_ 425
dm 419
ii 418
dv 406
nq 405
wl 397
ux 397
rw 395
eq 391
sf 377
uo 368
uv 362
tm 361
bt 359
mn 350
kp 317
rh 315
hs 315
té 306
nb 305
sq 303
zi 296
cb 290
ré 283
dj 282
tf 275
ée 269
tn 262
vs 262
yb 249
ér 242
_z 242
gt 238
êt 238
vr 235
ym 234
oj 228
ét 225
yl 224
tz 222
hn 220
yp 218
vl 215
aj 214
bj 214
nj 208
ox 208
nm 207
ky 205
sé 202
nh 202
yr 201
vy 201
gy 198
hl 198
fs 196
j_ 190
mf 189
za 187
èr 178
ép 175
yc 174
mt 171
oz 169
xx 169
né 168
nw 161
éc 161
és 158
zh 154
np 152
yw 149
cq 146
ln 144
cs 136
zu 132
_ê 130
én 130
pc 127
fy 127
iq 125
cd 124
xv 121
là 118
nr 115
lq 114
ax 113
vé 112
yf 110
mê 110
lg 109
êm 108
kw 108
lé 107
df 106
él 105
pf 103
zz 102
lb 102
ej 102
kr 101
gm 101
dk 100
wb 98
hk 97
ès 97
hf 96
mé 96
nx 93
pé 93
gé 91
yn 90
aî 90
mr 90
rè 87
ît 85
xh 85
dw 85
iè 85
uh 83
tp 82
rz 82
sg 80
sv 79
vu 77
hé 76
bm 74
hd 72
hb 71
hc 71
wk 69
ié 67
gd 66
ù_ 65
où 65
wf 64
zm 63
âm 62
yu 61
cé 61
sr 61
hh 59
mc 59
rq 59
év 59
zl 59
dh 57
_â 55
dt 54
ég 54
bv 53
dq 53
oq 52
ao 52
uy 51
éd 51
wy 50
pm 49
ût 48
rê 47
kc 47
uj 47
aa 47
hw 47
pè 45
ôt 45
cè 43
ué 42
lh 41
yk 40
cx 40
âc 39
lè 38
nz 37
ço 37
iw 36
zy 35
ww 35
sû 34
nê 34
ûr 33
dè 32
km 32
kb 32
èn 32
pw 32
èl 31
yd 30
lx 30
wu 30
xy 29
fw 29
ji 28
bd 28
lz 27
ât 27
pk 27
rj 27
zd 27
dc 27
fb 27
kg 27
pç 26
aq 25
tb 25
yh 25
bh 25
râ 25
tô 25
èv 23
éa 23
dp 23
xl 22
éf 22
éj 21
hè 21
bé 21
pg 20
bw 20
mè 20
xu 20
_ô 20
wt 20
kf 20
oû 19
tê 19
gw 18
rô 18
db 18
jà 18
wc 18
zn 17
çu 17
pê 17
fâ 17
xo 17
tv 17
eû 17
tè 17
èm 17
îl 16
ôl 16
_î 16
eç 16
hâ 16
ô_ 15
fé 15
pb 14
èc 14
êc 13
èt 13
yé 13
tk 13
yz 13
bn 13
gè 12
în 12
mw 12
tg 12
kv 12
éb 11
yg 10
éq 10
pn 10
lâ 10
aï 10
td 10
cô 10
wm 9
èd 9
rç 9
cv 9
vè 9
éi 9
ih 9
oë 8
ët 8
vk 8
êl 7
xq 7
kt 7
fk 7
cw 7
hp 7
fû 7
vê 6
mk 6
yv 6
sè 6
sj 6
nç 6
kd 6
zè 5
âg 5
sz 5
fn 5
ça 5
zv 5
jr 5
cz 5
ïr 5
ûm 4
ij 4
ên 4
èg 4
fî 4
zk 4
lû 4
vô 4
hv 4
cm 3
fh 3
zs 3
ûl 3
pd 3
tâ 3
gf 3
rû 3
dû 3
bc 3
pû 3
uï 3
mv 3
âl 3
éu 3
êv 3
uu 3
ân 3
ï_ 3
uw 3
uê 3
pâ 3
èb 3
ïs 3
uq 3
vg 3
û_ 2
éé 2
vv 2
dz 2
wp 2
aç 2
bâ 2
oé 2
gê 2
_è 2
éo 2
ôn 2
q_ 2
hê 2
îm 2
fm 2
bî 2
tx 2
xé 2
gb 2
dâ 2
ïe 2
gâ 2
dî 2
oï 2
îc 2
oî 2
vt 2
gz 2
èq 2
fê 1
ôd 1
bf 1
ûc 1
rî 1
kk 1
vw 1
nû 1
ïl 1
qv 1
hg 1
oê 1
pô 1
fp 1
mg 1
câ 1
cg 1
nâ 1
wg 1
âb 1
iâ 1
cp 1
mô 1
mq 1
xm 1
ïv 1
iy 1
cê 1
éâ 1
hq 1
mâ 1
zw 1
fè 1
zb 1
cn 1
xs 1
ûn 1
bû 1
éh 1

In [129]:
bigrams[0][1]


Out[129]:
16694

In [20]:
%matplotlib inline
import matplotlib.pylab as plt

# Reduction table
my2ascii_table = {
    ord(u'â'):"a",
    ord(u'ä'):"e",
    ord(u"à"):"a",
    ord(u"æ"):"a",
    ord(u'ç'):"c",
    ord(u"é"):"e",
    ord(u"è"):"e",
    ord(u"ê"):"e",
    ord(u"ë"):"e",
    ord(u'ğ'):"g",
    ord(u'ı'):"i",
    ord(u"î"):"i",
    ord(u'ï'):"i",
    ord(u'œ'):"o",
    ord(u"ô"):"o",
    ord(u'ö'):"o",
    ord(u'ş'):"s",
    ord(u'ù'):"u",
    ord(u"û"):"u",
    ord(u'ü'):"u",
    ord(u'ß'):"s"
    }


def letter2idx(x):
    if x=='_':
        i = 0
    else:
        i = ord(x)-ord('a')+1
    
    if i<0 or i>26:
        i = ord(my2ascii_table[ord(x)])-ord('a')+1
        
    return i

T = np.zeros((27,27))
# Convert bigrams to a transition matrix
for pair in bigrams:
    c = pair[1]
    s = list(pair[0])
    j = letter2idx(s[0])
    i = letter2idx(s[1])
    T[i,j] += c

plt.figure(figsize=(8,8))

alphabet=[chr(i+ord('a')) for i in range(26) ]
alphabet.insert(0,'_')
M = len(alphabet)

plt.imshow(T/np.sum(T,axis=0), interpolation='nearest', vmin=0,cmap='gray_r')
plt.xticks(range(M), alphabet)
plt.xlabel('x(t)')
plt.yticks(range(M), alphabet)
plt.ylabel('x(t-1)')
ax = plt.gca()
ax.xaxis.tick_top()
#ax.set_title(f, va='bottom')
plt.xlabel('x(t)')

plt.show()


Monte Carlo with Spark


In [24]:
import numpy as np
import pyspark

def sample(p):
    x, y = 2*np.random.rand()-1, 2*np.random.rand()-1
    return 1 if x*x + y*y < 1 else 0

NUM_SAMPLES = 1000000

count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)

print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))


Pi is roughly 3.148096

Spark application:

  • a driver program that runs the user’s main function, that executes various parallel operations on a cluster.

Main abstraction:

  • resilient distributed dataset (RDD),
  • a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

RDD:

  • Created from a file in the Hadoop file system (or any other Hadoop-supported file system)
  • Automatically recover from node failures.

Cached:

  • Ask Spark to persist an RDD in memory

Shared variables:

  • broadcast variables: cache a value in memory on all nodes,
  • accumulators: counters and sums.

Spark’s Python API data formats (as of Nov 2016):

  • SparkContext.wholeTextFiles: read a directory containing of multiple small text files, return (filename, content) pairs.

  • RDD.saveAsPickleFile and SparkContext.pickleFile: save an RDD as pickled Python objects.

  • SequenceFile and Hadoop Input/Output Formats

Reading Json


In [21]:
from pyspark.sql import SparkSession

#    .config("spark.some.config.option", "some-value") \

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

df = spark.read.json("data/products.json")
# Displays the content of the DataFrame to stdout
df.show()


+-------+--------------------+
|   name|          properties|
+-------+--------------------+
|Product|[[Product identif...|
+-------+--------------------+


In [22]:
#df.printSchema()
df.select("properties").show()


+--------------------+
|          properties|
+--------------------+
|[[Product identif...|
+--------------------+

Reading Parquet


In [23]:
df = spark.read.parquet("file")


---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
<ipython-input-23-9de69b10c370> in <module>()
----> 1 df = spark.read.parquet("/Users/cemgil/Downloads/sahibinden_NativeAdFeaturesEdr/dt=20171028/000000_0")

/Users/cemgil/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/readwriter.pyc in parquet(self, *paths)
    272         [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')]
    273         """
--> 274         return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths)))
    275 
    276     @ignore_unicode_prefix

/Users/cemgil/spark-2.0.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/Users/cemgil/spark-2.0.1-bin-hadoop2.7/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: u'Path does not exist: file:/Users/cemgil/Downloads/sahibinden_NativeAdFeaturesEdr/dt=20171028/000000_0;'

In [28]:
df.describe()


Out[28]:
DataFrame[summary: string, flakeid: string, deliveryid: string, adid: string, version: string, channel: string, bidpricekurus: string, dailybudgetkurus: string, matchedadcity: string, adimpressioncount: string, adclickcount: string, adctr: string, adtitle: string, addescription: string, adcalltoaction: string, accountadcount: string, accountdayssincefirstadcreation: string, eventtype: string, vieweruniqueuserid: string, viewercity: string, vieweradimpressioncount: string, viewertotalimpressioncount: string, viewertotalclickcount: string, viewerip: string, trackingsinceseconds: string, vieweruserid: string, edrtypename: string, edrgroup: string, edrhostname: string, edruniqueid: string]